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(57) ABSTRACT 

A method for transmitting data in a network from a source 
node to a destination node via a path of links includes the 
steps of transmitting data packets from the source node to an 
intermediary point. Once a particular packet is successfully 
received at an intermediary point, the particular packet is 
de-allocated at the source node, as are any other packets in 
the buffer between the particular packet and the last 
acknowledged packet. Upon receipt of an error indication, 
each packet is retransmitted along with all subsequent 
packets. After a predetermined number of attempts of suc- 
cessfully transmitting the data has expired, it is determined 
that the link between one intermediary point and another 
intermediary point or one intermediary point and the desti- 
nation node has failed. The packets are returned to the source 
node and a verification packet is sent across the path of links 
to verify the at least one link has failed. Upon verification 
that at least one link has failed, an alternate path of links for 
transmitting the packets from the source node to the desti- 
nation node is established. 

23 Claims, 4 Drawing Sheets 
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METHOD AND APPARATUS FOR FAILURE acknowledge all packets prior to the requested packet and to 

AND RECOVERY IN A COMPUTER request transmission of the packet associated with the 

NETWORK request number. The go back number n is a parameter that 

determines how many successive packets can be sent from 

RELATED APPLICATIONS 5 the transmitter in the absence of a request for a new packet. 

This application claims priority to U.S. Provisional Appli- Specifically, the transmitting node is not allowed to send 

cation No. 60/057,221, filed on Aug. 29, 1997, entitled packet i+n before i has been. acknowledged (i.e., before i+1 

"Method and Apparatus for Communicating Between Inter- has been requested). Thus, if 1 is the most recently received 

connected Computers, Storage Systems, and Other Input/ rec i uest from the receiving node, there is a window of n 

Output Subsystems" by inventors Ahmet Houssein, Paul A. 10 P ackets that the transmitter is allowed to send before receiv- 

Grun, Kenneth R. Drottar, and David S. Dunning, and to in & the next acknowledgment. In this protocol, if there is an 

U.S. Provisional Application No. 60/081,220, filed on Apr. error > the eatire window must be resent as the receiving node 

9, 1998, entitled "Next Generation Input/Output" by inven- Wl11 on!v P ermit reception of the packets in order. Thus, even 

tors Christopher Dodd, Ahmet Houssein, Paul A. Grun, lf . the error hes near lhe end of the window, the entire 

Kenneth R. Drottar, and David S. Dunning. These applica- 15 window must be retransmitted. This protocol is most suit- 

tions are hereby incorporated by reference as if repeated able for lar S e networks having high probabilities of 

herein in their entirety, including the drawings. Furthermore, error. 

this application is related to U.S. patent application Ser. No. In ^ architecture that permits large data packets, unnec- 

09/141,151 filed by David S. Dunning and Kenneth R. essarily retransmitting excess packets can become a signifi- 

Drottar on even date herewith and entitled "Method and 20 ca nt efficiency concern. For example, retransmitting an 

Apparatus for Controlling the Flow of Data Between Serv- entire window of data packets, each on the order of 4 

ers." This application is also related to U.S. patent applica- Gigabytes, would be relatively inefficient, 

tion Ser. No. 9/141,134 filed by David S. Dunning, Ken Other known flow control protocols require retransmis- 

Drottar and Richard Jensen on even date herewith and sion of only the packet received in error. This requires the 

entitled "Method and Apparatus for Controlling the Flow of 25 receiver to maintain a buffer of the correctly received 

Data Between Servers Using Optimistic Transmitter*' and packets and to reorder them upon successful receipt of the 

U.S. Pat. No. 6,181,704 filed by David S. Dunning, Ken retransmitted packet. While keeping the bandwidth require- 

Drottar and Donald Cameron on even date herewith and ments to a minimum, this protocol significantly complicates 

entitled "Method and Apparatus for Input/Output Link the receiver design as compared to that required by Go Back 

Retry, Failure and Recovery in a Computer Network." 30 n ARQ. 

The present invention is therefore directed to the problem 

BACKGROUND OF THE INVENTION 0 f developing a method and apparatus for controlling the 

The present invention relates generally to methods and flow of data between nodes in a system area network that 

apparatuses for controlling the flow of data between nodes improves the efficiency of the communication without 

(or two points) in a computer network, and more particularly over ly complicating the processing at the receiving end. 

to a method and apparatus for controlling the flow of data SUMMARY OF THE INVENTION 

between two nodes (or two points) in a system area network. ^ m proyides a method fof transmitt - ng 

For the purposes of this application, the term "node" will data m a mtwo± from a S0Ufce node tQ a destination node . 

be used to describe either an origination point of a message 40 According to the memod of the preS ent invention, data 

or the termination point of a message. The term "point" will packe(s afe transmitted fom the source node to at least one 

be used to refer to a transient location in a transmission intermediary point via a path of links. Upon receiving a 

between two nodes. The present invention includes commu- predetermined number of error indications that at least one 

mcations between either a first node and a second node, a of the data packets was not correctly reC eived by a point 

node and a switch, which is part of a link, between a first 45 subsequ ent to the source node in the transmission path, the 

switch and a second switch, which comprise a link, and data packet ^ returned t0 the aode> 

between a switch and a node. The preseQt provides an ap p aratus for commu- 

An existing flow control protocol, known as Stop and nicating data between two node made of multiple links and 

Wait ARQ, transmits a data packet and then waits for an mu iti p l e fabrics. The apparatus includes two switches and a 

acknowledgment (ACK) before transmitting the next packet. 50 controller. The first switch is disposed in a first fabric, and 

As data packets flow through the network from one point to transmits the data packets from a first node to a second node, 

the next point, latency becomes a problem. Latency results Upon rece iving data from the first node, the first switch 

from the large number of links and switches in fabrics which an acknowledgment that each packet was successfully 

make up the network. This is because each packet requires rec eived. The controller determines after a predetermined 

an acknowledgment of successful receipt from a receiving 55 number of error indications that at least one of the links has 

node before the next data packet is sent from a transmitting failed At this lime> the firsl switch relurns the data to the firsl 

node. Consequently, there is an inherent delay due to the node second switch is d j sposed j n a second fabric, and 

transit time for the acknowledgment to reach the transmit- receives data returned by the first switch and transmitting the 

ting node from the receiving node. data pac k ets t0 tDe second node via an alternate path. 

One solution, which is known as Go Back n ARQ, uses 60 ^ „ ™^ ™ .„,^r^ 

sequentially numbered packets, in which a sequence number DESCRIPTION OF THE DRAWINGS 

is sent in the header of the frame containing the packet. In FIG. 1 illustrates an overall NG I/O link architecture 

this case, several successive packets are sent without waiting according to one exemplary embodiment of the present 

for the return of the acknowledgment. According to this invention. 

protocol, the receiving node only accepts the packets in the 65 FIG. 2 is a block diagram of an NG I/O architecture for 

correct order and sends request numbers (RN) back to the I/O pass through according to one exemplary embodiment of 

transmitting node. The effect of a given request number is to the present invention. 
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FIG. 3 illustrates the point-based protocol operation and 318B via a switched NG I/O fabric 328, including one 

according to the present invention. or more NG I/O links (e.g., NG I/O links 330A330B, 330C, 

FIG. 4 illustrates the point-based protocol operation with 330D). I/O systems 318 can be remotely located from 

multiple nodes according the present invention. computers 310 and 360. 

FIG. 5 illustrates a link failure scheme according to the s Computer 310 includes a CPU/memory complex 312 

present invention. (including a CPU and main memory typically intercon- 
nected via a host bus, not shown), an NG I/O host bridge 

DETAILED DESCRIPTION 314, secondary memory 315 (such as a hard disk drive), and 

Architectural Overview a network controller 316. For outbound transactions (e.g., 

Next Generation Input/Output (NG I/O) Architecture is a 10 information being sent from computer 310 to an I/O system 

general term to describe systems that are based on the 318), NG I/O host bridge 314 operates to wrap the host 

concepts of NG I/O and that employ an NG I/O fabric. The transaction in a NG I/O packet for transmission over the NG 

NG I/O fabric is the set of wires and switches that allow two I/O fabric 328. For inbound transactions (e.g., information 

NG I/O devices to communicate. The NG I/O fabric is a being sent from an I/O system 318 to computer 310), NG I/O 

standard interface designed to connect server nodes into a 15 host bridge 314 operates to unwrap the data (e.g., the PCI 

cluster and to connect various I/O devices, such as storage transaction) provided in an NG I/O packet over fabric 328, 

devices, bridges, and network interfaces. One or more NG and then convert the unwrapped data (e.g., the PCI 

I/O "switches," together with a, series of links, comprise a transaction) to a host transaction. Like computer 310, com- 

" fabric." puter 360 includes a CPU/memory complex 362, NG I/O 

An NG I/O link is the wires used to interconnect two 20 host bridge 364, a secondary memory 365, and a network 

points and the accompanying protocol that runs over those controller 366. Computer 360 operates in a similar manner 

wires. An I/O pass through is a method of connecting I/O to computer 310. 

devices to a computer node, or connecting two computer Each I/O system 318 includes an NG I/O to PCI Bridge 

nodes together, based on load/store memory transactions. An 320, a PCI storage controller 324 coupled to the NG I/O to 

interconnect based on I/O pass through is transparent to the 25 PCI bridge 320 via a PCI bus 322, and one or more I/O 

entities at either end of the interconnect. NG I/O (physical) devices 326. (As illustrated in FIG. 3, the A suffix identifies, 

is a minimum set of wires and the protocol that runs on the components for I/O system 3 18 A, and the B suffix indicates 

link that interconnect two entities. For example, the wires corresponding components of I/O system 318B). For out- 

and protocol connecting a computer node to a switch com- bound transactions, the NG I/O to PCI Bridge 320 operates 

prise a link. NG I/O bundled refers to the capability to 30 to unwrap the data of a NG I/O packet received over the NG 

connect two or more NG I/O links together in parallel. Such I/O fabric 328, and , then convert the unwrapped data (e.g, a 

bundled links can be used to gain increased bandwidth or host transaction or data) to a PCI transaction. Likewise, for 

improve the overall reliability of a given link. According to inbound transactions, NG I/O to PCI Bridge 320 operates to 

the present invention, a switch is defined as any device that wrap the PCI transaction in a NG I/O packet for transmission 

is capable of receiving packets (also referred to as I/O 35 over the NG I/O fabric 328 to computer 310. 

packets) through one or more ports and re-transmitting those PCI storage controller 324 operates to control and coor- 

packets through another port based on a destination address dinate the transmission and reception of PCI transactions 

contained in the packet. In network terms, a switch typically between PCI bus 322 and I/O devices 326. I/O devices 326 

operates at the data link layer of the Open Systems Inter- can include, for example, a SCSI storage device, or other I/O 

connection (OSI). 40 device. 

FIG. 1 illustrates the overall NG I/O link architecture While the embodiment of the NG I/O architecture of the 
according to an exemplary embodiment of the present present invention illustrated in FIG. 2 includes an NG I/O to 
invention. The overall NG I/O link architecture can be PCI bridge 320, it should be understood by those skilled in 
illustrated as including one or more computers 210 (e.g., the art that other types of bridges can be used. For example, 
servers, workstations, personal computers, or the like), 45 generically speaking, bridge 320 can be referred to as a 
including computers 210A and 210B. The computers 210 "network to peripheral bridge" for converting network pack- 
communicate with each other via a switched NG I/O fabric ets to and from a format that is compatible with bus 322 (bus 
that may include a layered architecture, including a network 322 may be a wide variety of types of I/O or peripheral 
layer 212, a data link layer 214 and a physical layer 216. An buses, such as a PCI bus). Likewise, PCI storage controller 
NG I/O switch 220 (e.g., including data link and physical 50 324 can be generically referred to as a "peripheral storage 
layers) interconnects the computers 210A and 210B. Each controller" for any of several types of I/O devices, 
computer 210 can communicate with one or more I/O Therefore, the present invention is not limited to PCI 
devices 224 (224A and 224B) via the NG I/O fabric using, bridges, but rather, is applicable to a wide variety of other 
for example, an I/O pass through technique 226 according to I/O buses, such as Industry Standard Architecture (ISA), 
the present invention and described in greater detail below. 55 Extended Industry Standard Architecture (EISA), Acceler- 
Each computer 210 can communicate with one or more I/O a ted Graphics Port (AGP), etc. PCI is merely used as an 
devices 224 (224A and 224B), alternatively using a distrib- example to describe the principles of the present invention, 
utcd message passing technique (DMP) 227. As a result, I/O Similarly, NG I/O to host bridge 364 can be generically 
devices 224 may be remotely located from each computer referred to as a "network to host bridge" because it converts 
210. 60 (NG I/O) network packets to and from a host format (host 

FIG. 2 is a block diagram of an NG I/O architecture for transactions). 

I/O pass through according to an embodiment of the present FIG. 2 illustrates that an NG I/O fabric 328 can be used 

invention. The NG I/O architecture includes a computer 310 to move storage devices out of the server cabinet and place 

and a computer 360, each which may be a server, the storage devices remote from the computer 310. Fabric 

workstation, personal computer (PC) or other computer. 65 328 can include one or more point-to-point links between 

Computers 310 and 360 operate as host devices. Computers computer 410 and each I/O system 418, or can include a 

310 and 360 are each interconnected to I/O systems 318A number of point-to-point links interconnected by one or 
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more switches. This architecture permits a more distributed switch for link 8 sends an ACK that the packets have been 

environment than presently available. sent correctly, link-switch for switch 9 sends the packets to 

The present invention provides a simple means to create destination 4. Destination 4 must ACK send back an 

a working network with flow control mechanisms which do acknowledgment to link-switch for switch 9 that the data 

not allow for lost data due to congestion and transient bit 5 was sent correctly. A new set of sequence numbers is 

errors due to internal or external system noise. The present assigned to the packets sent from link-switch for switch 9 to 

invention uses an approach to flow control that does not the destination. 

require end-to-end or link-to-link credits, rather the present Transient errors are errors that occur when packets are 

invention combines the ability to detect a corrupted or out of sent from a sending node to a receiving node. In the event 

order packets and retry (resend) any/all packets to maintain 10 of a transient error due to internal or external system noise, 

that all data is delivered uncorrupted, without losing any data may be corrupted between the source 3 and the desti- 

data and in the order that the data was sent. This is nation 4. The receiver of the packets must calculate the CRC 

accomplished by assigning a sequence number and calcu- across the data received, and compare it to the CRC 

lating a 32 bit Cyclic Redundancy Check (CRC) with each appended to the end of the packet. If the calculated CRC and 

packet and acknowledging (ACK) or negative acknowledg- is the received CRC match, the packet will be ACKed. If the 

ing (NAK) each packet. two CRC's do not match, that packet must be NAKed, again 

The present invention assumes a network built out of identified by the sequence number. Upon receipt of a NAK, 

point-to-point links. Referring to FIG. 3, the minimum sized the sender must resend the specified packet again, followed 

network 10 is two endpoints 3 and 4 connected via a fabric by all packets following that packet. For example, if the 

15. For simplicity, the two endpoints in the network 3 and 4 20 sender has sent packets up to sequence number 16 but 

are named the source and the destination, respectively, and receives a NAK for packet #14, it must resend packet #14, 

will be used to describe the present invention, noting that the followed by packet #15 and packet #16. Note that ACKs and 

present invention holds for any unlimited sized network. NAKs can still be combined. Using the example in the 

Fabric 15 includes a switch 13 and links 8 and 9. Link 8 previous paragraph, of packet 9 is ACKed, then packets 

connects the source to switch 13 and link 9 connects the 25 #10-#13 are assumed received in order and without data 

destination with switch 13. As stated above, the NG I/O corruption, followed by packet #14 with corrupted data; a 

protocol operates point-to-point 200 and not end-to-end 100 NAK of packet #14 signifies that packets #10-#13 were 

as shown. received without error, but that packet #14 was received with 

The present invention assumes a send queue and receive error and must be resent, 

queue at each endpoint (i.e., at the source, there is a send 30 FIG. 4 is a block diagram illustrating NG I/O links 

queue SE1 and a receive queue RE1 and at the destination, according to an embodiment of the present invention. Fabric 

there is a send queue SE2 and a receive queue RE2) and a 400 is connected between nodes A, B and C labeled 401, 402 

send and receive queue at each link-switch connection in and 403, respectively. As shown in FIG. 4, a link 411 is 

fabric 13 (i.e., at the link-switch connection for link 8, there disposed between node A and fabric 400, a link 412 is 

is a send queue XI and a receive queue X2 and at the 35 disposed between node B and fabric 400 and a link 413 is 

link-switch connection for link 9, there is a send queue X3 disposed between node C and fabric 400. Each link is a 

and a receiver queue X4). The size of the send queue SE1 bi-directional communication path between two NG I/O 

need not match the size of the receive queue X2, nor does connection points in the fabric 400. As shown in FIG. 4, an 

the send queue XI need to match the size of receiver queue unidirectional path 431 of link 411 is connected between an 

RE1. This is also true for send and receive queues to and 40 output port 422 of node A and an input port 414 of fabric 400 

from destination 4 and the link-switch connection for link 9. and an unidirectional path 432 is connected between the 

In general, send queues will be larger than receive queues input port 426 of node A and the output port 428 of fabric 

(however, this is not required for purposes of the present 400, thereby providing a bi-directional link, 

invention). In this example, the size of send queue RE1 at Referring back to FIG. 4, suppose for example, nodes A 

the source is defined as SI, the size of receive queue 2 at the 45 and C desire to communicate with node B by sending 

source is defined as Rl, the size of send queue 2 at the packets of data to node C. According to the principles of the 

destination is defined as S2, and the size of receive queue 2 present invention, node A can forward packets #l-#3 across 

at the destination is defined as R2. In addition, send and link 411 to fabric 400. These packets are assigned a sequence 

receive queues X1-X4 have sized defined as LX1-LX4, number which indicates the order in which the packets must 

respectively. 50 be received by a receiving point or node. Node C, also 

Node A is allowed to send up to SI packets to the receive wishing to communicate with node B forwards packets 

queue XI on switch 13. Under congestion-free conditions, #U-#12 across link 413 to fabric 400. Again, sequence 

packets received at switch 13 will be processed and imme- numbers are assigned to these packets to ensure they are 

diately passed on to destination 4. Referring back to the received in the order transmitted. Fabric 400 includes at least 

example in FIG. 3, switch 13 must send back an acknowl- 55 one switch 410 used to receive the transmitted packets, 

edgment (ACK) notifying the source that the packets have Switch 410 also assigns a new sequence number to all 

been received correctly by the link-switch connection for packets it receives. For instance, switch 410 assigns a new 

link 8 by acknowledging a sequence number. Packets have sequence number to packets #l-#3 and #11-#13. The 

a unique sequence number associated by link. On any given sequence number for these packets can be arranged in more 

link, packets must arrive in the order transmitted. On any 60 than one way as long as they follow the same sequence that 

given link, descriptors are retried in the order they were . the packets were sent from the transmitter to switch 410. For 

queued. Note, that as an efficiency improvement to this example, switch 410 can assign new sequence number 

algorithm, the link-switch for link 8 can ACK multiple 101-106 to packets #l-#3 and #11-#13, respectively. In 

packets at one time by ACKing the highest sequence number addition, new sequence numbers 101-106 can be assigned to 

that has been correctly received, e.g., if the source 3 receives 65 packets #1, #11, #2, #12, #3, and #13, respectively. Other 

an ACK for packet #9, then receives an ACK for packet #14, assignments are possible without departing from the present 

packets #10-#13 are also implicitly ACKed. After the link- invention. 
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According to the principle of the present invention, the A knows that this data has been returned because the address 

identification of the source transmitting the packets is no header of the data packet indicates that node A is the source 

longer needed. Thus, once the packets are sent from the and that node B is the destination. Because node A knows it 

source to the switch, and acknowledged by the switch, the sent the data to node B, node A temporarily stores the data 

identification of the source is no longer required. Referring 5 m memory. In order to verify that the reason the data was not 

back to the previous example, packets #l-#3 and #11-#13 delivered to node B, node A sends a verification packet 

assigned new sequence numbers 101-106, are then for- across the path of links to node B. This verification packet 

warded to node B. At node B, the packets are either ACKed will solicit acknowledgments from intermediate switches 

or NAKed. If the packets are acknowledged, then successful having operable links and an error indication from the 

data transmission has been completed. In the alternative, if 10 intermediate switch preceding a failed link. Since the veri- 

a NAK has been received by the switch from node B, then fication packet is a special packet and used for the purpose 

the switch determines which packets must be resent. of verifying that a link has failed, node A will recieve a 

According to the features of the present invention, new returned acknowledgment or an error indication from each 

sequence numbers 101-106 are used to identify the packets. of tne intermediary nodes. If node A receives an acknowl- 

Thus, a NAK for sequence number 104 signifies that packets is edgment from node B, this indicated that there is no failed 

represented by sequence numbers 101-103 were received link and node A retransmits the original data. Alternatively, 

without error but the packets represented by sequence num- ^ node A receives an error indication, then node A deter- 

bers 104-106 was received with error and must be resent. mines tnat at l east on Knk in the path of links has failed. 

If congestion in the network occurs, received packets may After determining that at least one link has failed, a 

not be able to immediately make progress through the 20 controller 590, establishes an alternate path of links for the 

network. Congestion in a network is the overcrowding of returned data. The alternate path of links includes switch not 

packets across the network. Congestion control and conges- in the sam e fabric. Thus, as shown in FIG. 5 switch X an 

tion management are two mechanisms available for a net- switch Y are in separate fabrics. Thus, a path from link 2 to 

work to effectively deal with heavy traffic volumes. Refer- !ink 6 would be an alternative path of links for the returned 

ring back to FIG. 3, when a local buffer space is filled at a 25 data at noc j e A. 

receiving queue, additional packets will be lost, e.g., when This verification process is called the "caboose protocol" 

queue XI fills up, packets that follow will be thrown away. and > as stated a bove, uses point-to-point transmission of 

However, given that retry can occur across each point of a data - Moreover, the caboose protocol is used at a point 

network instead of each end, packets being thrown away are subsequent to the source successfully transmitting the data 

relatively simple to recover from. As soon as receiving 30 t0 an intermediary point. Further, the caboose protocol is 

queue XI starts moving packets out of its receive buffers, it applicable for two separate fabrics that are not connected, 

opens up room for additional packets to be received. The wherein an alternative path can be established, 

receive queue X2 will check the sequence number of the There are many advantages of the present invention. For 

next packet it receives. In the event that source 3 has sent example, the present invention allows for retry of corrupted 

packets that were dropped, the first dropped packet will be 35 P a ckets at each point in the network instead of at the source 

NAKed and therefore resent from that packet on. and tne destination of the network. According to the present 

According to the present invention, the send queue SI just invention, after a first node (source) transmits to and 

keeps sending packets until its send queue is full of packets receives an acknowledgment from an intermediate point the 

that have not been ACKed. It must wait for an ACK for those first node is no longer relied upon to resend information that 

packets before it can reuse those buffers (it needs to be able 40 mav De corrupted later in the transmission path from other 

to retry those packets if necessary). intermediary nodes to the destination. Thus, the retry feature 

FIG. 5 illustrates a link failure scheme according to the of the present invention simplifies data transmission and 

present invention. The link failure scheme includes nodes A, makes it more efficient. Since the first node is no longer 

B, and C labeled 500, 501 and 502, respectively. The scheme re lied upon to resend data if corruption occurs during 

further includes two fabrics 510 and 520 and links 1-6 45 transmission, the first node is free to send additional data to 

connecting the nodes via the fabrics.-Babric 510 includes otn er locations or to the same destination after the first node 

switch X and fabric 520 includesjjswitch Y: receives an acknowledgment from the first intermediate 

Node A can communicate with node B via fabric 510 (i.e., point, 

link 1 to link 5) or via fabric 520 (i'.e, link 2 to link 6). Node Additionally, the present invention implements flow con- 

C can also communicate with node B via fabric 510 (i.e., 50 ttol between two points which will yield better bandwidths 

link 3 to link 5) or via fabric 520 (i.e., link 4 to link 6). for link efficiency than a traditional credit based flow 

According to the present invention, the link failure scheme control— a credit base scheme stops sending packets when 

operates according to a point-to-point protocol. Thus, when a U credits are used up, and transmission cannot resume until 

node A forwards data to switch X and switch X acknowl- additional credits are received. Therefore, in a credit based 

edges receipt of the data, node A de-allocates the retained 55 scheme the time to start and stop data transfer is dependent 

copy of the transmitting data. 0D the round trip time of the traversing link. The present 

Once the data reaches switch X or switch Y, the data is invention is optimistic in that it sends packets with the 

then transmitted to node B. If node B fails to acknowledge expectation that they will be received correctly and is not 

the data, due to error in transmission or link failure, the data dependent on the round trip time of the link, 

is returned to the source node, which can be identified by 60 What is claimed is: 

receiving a predetermined number of negative acknowledg- 1- A method for transmitting data in a network from a 

ments or none at all, after a predetermined number of source node to a destination node comprising the steps of: 

attempts, such as three or five, for example. (Other numbers transmitting data in a plurality of packets from said source 

of attempts are possible without straying from the concept of node to at least one intermediary point via a path of 

the present invention). 65 links; 

For example, if switch X is unable to receive an acknowl- returning at least one of the plurality of packets to the 

edgment from node B, the data is returned to node A. Node source node upon receiving a predetermined number of 
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error indications that the plurality of packets were not 
correctly received by a point subsequent to the source 
node in the transmission sequence; and 
retaining a copy of each packet in a buffer at the source 
node until receiving an acknowledgment that said 
packet was successfully received by said intermediary 
point. 

2. The method according to claim 1 further comprising the 
step of transmitting a verification packet verifying at least 
one link has failed. 

3. The method according to claim 2, further comprising 
the step of establishing an alternative path of links for 
transmitting said packets from said source node to said 
destination node. 

4. The method according to claim 1, further comprising 
the step of retaining a copy of each packet in a buffer at the 
intermediate point until receiving an acknowledgment that 
said each packet was successfully received. 

5. The method according to claim 1, wherein the prede- 
termined number of error indications includes one selected 
from the group consisting of three, five, and between three 
and five. 

6. The method according to claim 1, further comprising 
the steps of: 

retransmitting from the intermediate point each packet 
and all subsequent packets upon receipt of an error 
indication; and 

terminating retransmission attempts after a predetermined 
number of attempts have been reached. 

7. The method according to claim 2, wherein the step of 30 
transmitting the verification packet comprises the steps of: 

maintaining a copy of the verification packet in a.buffer at 
the source node whether receiving an acknowledgment 
or an error indication from a subsequent node; and 

receiving the verification packet returned by a node 
preceding a failed link. 

8. A method for transferring data across a fabric in a 
system area network including a plurality of links using a 
point-to-point protocol, said method comprising the steps of: 

transmitting the data in a plurality of packets via a path of 
links; 

retaining each packet in a buffer at a source node until 
receiving either an acknowledgment indicating that 
said each packet was successfully received or an error 
indication that a received version of said each packet 
included at least one error, while simultaneously trans- 
mitting additional packets; 

using a single negative acknowledgment to indicate that a 
packet associated with the negative acknowledgment 
includes at least one error and to simultaneously indi- 
cate that all previous packets received prior to the 
packet associated with the negative acknowledgment 
were received correctly; 

returning at least one of the plurality of packets to the 
source node upon receiving a predetermined number of 
negative acknowledgments that the at least one of the 
plurality of packets was not correctly received by a 
point subsequent to the source node in the path of links; 
and 

transmitting a verification packet verifying at least one 
link in the path of links has failed. 

9. The method according to claim 8 further comprising the 
step of establishing an alternate path of links for said packet 
from said source node to said destination node. 

10. The method according to claim 8, further comprising 
the step of indicating successful receipt of all packets 
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between a last acknowledged packet and a particular packet 
by sending a single acknowledgment. 

11. The method according to claim 8, further comprising 
the steps of: 

de-allocating a particular packet in the buffer at the source 
node upon receipt of an acknowledgment associated 
with said particular packet; and 

de-allocating any other packets in the buffer between said 
particular packet and a last acknowledged packet. 

12. The method according to claim 8, further comprising 
the step of de-allocating all buffered packets following a 
packet associated with the negative acknowledgment, and 
retransmitting all packets from the packet associated with 
the negative acknowledgment including the packet associ- 
ated with the negative acknowledgment. 

13. An apparatus for communicating data between two 
nodes in a network having a plurality of fabrics including a 
plurality of links, said apparatus comprising: 

a first switch being disposed in a first fabric, transmitting 
the data in a plurality of packets from a first node to a 
second node, wherein upon successful receipt of each 
packet, the first switch sends an acknowledgment that 
said each packet was successfully received, and trans- 
mits the data to the second node; 

a controller determining, after a predetermined number of 
error indications, at least one of said plurality of links 
has failed, wherein the first switch returns the data to 
the first node; and 

a second switch being disposed in a second fabric, receiv- 
ing from the first node the data returned by the first 
switch and transmitting data packets to the second node 
via an alternate path. 

14. The apparatus according to claim 13, wherein the first 
node de-allocates a packet upon receipt of an acknowledg- 
ment associated with said packet in the buffer in addition to 
all packets preceding said packet in the buffer. 

15. The apparatus according to claim 13, wherein the first 
node retransmits a particular packet and all packets in 
sequence subsequent to the particular packet upon receipt of 
an error indication associated with said particular packet. 

16. The apparatus according to claim 13, wherein said first 
switch and said second switch are independent of each other. 

17. The apparatus according to claim 13, wherein said first 
switch transmits a verification packet verifying said at least 
one link has failed. 

18. The apparatus according to claim 13, wherein said 
controller determines an alternate path for transmission of 
said data. 

19. A program storage device readable by a machine, 
tangibly embodying a program of instructions executable by 
a machine to perform method steps for transmitting data 
between switches in a fabric having a plurality of links, said 
method comprising the steps of: 

transmitting data in a plurality of packets from a source to 
a destination via at least one intermediary switch over 
a path of links; 

returning at least one of the plurality of packets to the 
source upon receiving a predetermined number of 
negative acknowledgments that at least one packet was 
not correctly received by a point subsequent to the 
source in a transmission path; 

transmitting a verification packet verifying at least one 
link has failed; and 

establishing an alternate path of links for transmitting the 
at least one packet from the source to the destination. 

20. The device according to claim 19, wherein the method 
further comprises the step of retaining a copy of each packet 
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in a buffer at said intermediate switch until receiving an 
acknowledgment that said each packet was successfully 
received. 

21. The device according to claim 19, wherein the method 
further comprises the steps of: 

de-allocating a particular packet in the buffer at the source 
upon receipt of an acknowledgment associated with 
said particular packet from said intermediary node; and 

de -allocating any other packets in the buffer between said 
particular packet and a last acknowledged packet. 

22. The device according to claim 19, wherein the method 
further comprises the steps of: 

retransmitting said each packet and all subsequent packets 
upon receipt of an error indication; and 

dropping all received packets following said each packet 
associated with the error indication until successfully 
receiving a retransmitted version of said each packet. 
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23. A device for transmitting data between switches in a 
fabric having a plurality of links comprising: 
means for transmitting data in a plurality of packets from 

a source to a destination via at least one intermediary 
5 switch over a path of links; 

means for returning at least one of the plurality of packets 

to the source upon receiving a predetermined number 

of negative acknowledgments that at least one packet 

was not correctly received by a point subsequent to the 

source in a transmission path; 
means for transmitting a verification packet verifying at 

least one link has failed; and 
means for establishing an alternate path of links for 
15 transmitting the at least one packet from the source to 

the destination. 

***** 
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